Yukun Zheng, Yuxin Wen and Ziyue Su
Internet is a necessity in our daily life and facilitates the world in so many aspects, from communication and sharing information with people, to online banking and shopping. Many people even view internet as a utility such as water, electricity and gas. However, do you know how many households in each state of the United States do not have internet, who are these people, and why they do not have internet?
With these questions in mind, we will analyze households with no internet access in the United States and deduce the potential causes. According to Kaggle, U.S. Census Bureau began asking internet use in American Community Survey in 2013; the dataset contains data for counties with population over 65,000.
Throughout this tutorial, we will collect, process and visualize the data presented in this dataset, as well as perform machine learning techniques on it. We aim to uncover the trends between households without internet connection and potential factors such as their ages, locations and incomes. We will hopefully inform the public that there still exist many households that do not have internet today.
We will need the following libraries for this project:
We highly recommend referring to the following resources for more information about pandas/installation and python 3.6 in general:
!pip install folium
!pip install graphviz
!conda install graphviz --y
!pip install pydotplus
import sqlite3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
from sklearn import datasets
from sklearn import preprocessing
from sklearn.preprocessing import LabelEncoder
from sklearn import linear_model
from sklearn.linear_model import LinearRegression
from sklearn import tree
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import accuracy_score
from sklearn.metrics import r2_score
from sklearn.metrics import explained_variance_score
from scipy import stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import folium
from folium.plugins import HeatMap
from numpy.polynomial.polynomial import polyfit
This is the data collection stage of the data life cycle. Throughout this step, we primarily focus on collecting data from valuable sources.
Here we are scraping data from the local csv file “kaggle_internet.csv” using pandas and then putting the data into a dataframe.
Kaggle has arranged the data into different columns as follows:
We will use pand.DataFrame to save the data. This data structure, which is a two-dimensional matrix, is so convenient since it can save different types of objects. There are also other powerful functions to process the data. More info from: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html.
We will manipulate and analyze the data later throughout this tutorial. The head of the original dataset is printed below:
# SCRAPE DATA FROM THE LOCAL CSV FILE
data = pd.read_csv("kaggle_internet.csv")
data.index = data.index + 1
data.head()
Below we will take you through the process of data tidying which is a critical step to structure the dataset and facilitate our future analysis.
This is the data processing stage of the data life cycle. During this phase, we essentially attempt to “fix up” the organization of our data so that it will be easier and more readable to analyze.
You can get more information regarding data cleaning and tidying at: https://realpython.com/python-data-cleaning-numpy-pandas/.
Here we are basically reformatting our dataset in correspondence with the formatting provided by Kaggle. We do so by including the percentages of each population in new columns “percent_something” and then dropping the original population columns.
Performing this step in the data life cycle is crucial in ensuring facilitated manipulation and flexibility for future analysis.
# ADD THE NEW COLUMN "PERCENT_SOMETHING" TO GET THE PERCENTAGE OF POPULATION
# ALSO DROP THE ORIGINAL COLUMN
temp = data.columns.values[6:18]
for i in temp:
temp_str = i[2:]
# GET THE PERCENTAGE OF EACH TYPES OF PEOPLE
data["percent_" + temp_str] = 100 * data[i] / data["P_total"]
# DROP THE ORIGINAL COLUMN
data.drop(i, axis=1, inplace = True)
data.head()
As we can see, there are missing data in the dataset. We regard the type of missingness as missing at random (MAR), which means the probability of missing data is dependent on the observed data but not the unobserved data. We infer that it is missing at random since we believe some people would refuse to answer questions with respect to their ages, household incomes and education levels.
We will tidy the missing data by mean substitution, where we impute the average from observed cases for all missing values of a variable. More specifically, we will use “df.fillna by mean” to replace each NaN value with the mean of its column.
For more information about handling the missing data you should refer to: https://machinelearningmastery.com/handle-missing-data-python/.
The head of the resulting tidy dataset is printed below:
# REPLACE THE NAN DATA WITH THE MEAN OF THE COLUMN (MEAN SUBSTITUTION)
# BY USING DF.FILLNA BY MEAN TO FILL THE NAN DATA
temp_mean = data.mean()
for j in data.columns.values[6:18]:
temp = []
for i in data[j]:
if i != i:
temp.append(temp_mean[j])
else:
temp.append(i)
data[j] = temp
data.head()
This is the exploratory analysis and visualization stage of the data life cycle where we attempt to plot our data in order to observe potential trends. We will also perform statistical analysis in this step in order to obtain better supporting evidence for trends that may be discovered.
Firstly, the map below pinpoints the locations of counties of different Gini index levels. According to Wikipedia, Gini index is a measure of the income or wealth distribution of a place’s residents and is the most commonly used measurement of inequality. There are altogether 820 counties within our dataset; the blue dots represent counties of Gini index that is less or equal to 0.4; the red dots represent counties of Gini index that is between 0.4 and 0.5; and the black dots represent counties of Gini index that is greater than 0.5.
map_osm = folium.Map(location=[39.29, -97.61], zoom_start=3.5)
map_osm
for i, r in data.iterrows():
color = "black"
if r["gini_index"] <= 0.4:
color = "blue"
elif r["gini_index"] <= 0.5:
color = "red"
folium.Circle(radius = 10, location = [r["lat"], r["lon"]], popup = "", color = color, fill = True).add_to(map_osm)
map_osm
Below shows the locations of counties with the highest and lowest Gini index. The top 30 highest counties are labeled as red and the top 30 lowest are labeled as green. As we can see from the map, coastal areas have higher Gini index compared to other areas in the United States, which indicates that coastal areas have a larger gap between the wealthy and poor. You will know the Gini index of each county by simply moving your mouse over its tag!
#map the top 30 highest and lowest Gini index counties and observe their location
#put your mouse over the tag to see its gini index
data.sort_values("gini_index",inplace=True,ascending=False)
data.head()
map_osm = folium.Map(location=[39.29, -97.61], zoom_start=3.5)
map_osm
map1=map_osm
index=1
for i,r in data.iterrows():
a=r['lat']
b=r['lon']
tooltip1=str(r['gini_index'])
if index>=1 and index<=30:
folium.Marker([a,b], popup='', icon=folium.Icon(color='red'),
tooltip=tooltip1).add_to(map1)
if index>=790 and index <=820:
folium.Marker([a,b], popup='', icon=folium.Icon(color='green'),
tooltip=tooltip1).add_to(map1)
index=index+1
map1
On the other hand, the map below shows the locations of counties with the highest and lowest percentages of median_age. The top 30 highest counties are labeled as red and the top 30 lowest are labeled as green. We can see that coastal areas (especially Florida) have the highest percentages of median_age. You will know the percentage of each county by simply moving your mouse over its tag!
#pay attention to bottom right
data.sort_values("median_age",inplace=True,ascending=False)
data.head()
map_osm = folium.Map(location=[39.29, -97.61], zoom_start=3.5)
map_osm
map1=map_osm
index=1
for i,r in data.iterrows():
a=r['lat']
b=r['lon']
tooltip1=str(r['median_age'])
if index>=1 and index<=30:
folium.Marker([a,b], popup='', icon=folium.Icon(color='red'),
tooltip=tooltip1).add_to(map1)
if index>=790 and index <=820:
folium.Marker([a,b], popup='', icon=folium.Icon(color='green'),
tooltip=tooltip1).add_to(map1)
index=index+1
map1
Below displays the locations of counties with the highest and lowest percentages of households with no internet access. The top 30 highest counties are labeled as red and the top 30 lowest are labeled as green. It is safe to conclude that the north of the United States has lower percent_no_internet compared to the south, and we see no correlation between median_age and percent_no_internet. You will know the percentage of each county by simply moving your mouse over its tag!
data.sort_values("percent_no_internet",inplace=True,ascending=False)
data.head()
map_osm = folium.Map(location=[39.29, -97.61], zoom_start=3.5)
map_osm
map1=map_osm
index=1
for i,r in data.iterrows():
a=r['lat']
b=r['lon']
tooltip1=str(r['percent_no_internet'])
if index>=1 and index<=30:
folium.Marker([a,b], popup='', icon=folium.Icon(color='red'),
tooltip=tooltip1).add_to(map1)
if index>=790 and index <=820:
folium.Marker([a,b], popup='', icon=folium.Icon(color='green'),
tooltip=tooltip1).add_to(map1)
index=index+1
map1
The heatmap below shows the graphical representation of households that do not have internet connection. We can see that a great part of these households are from the east.
#simple heatmap
#zoom in
map_osm = folium.Map(location=[39.29, -97.61], zoom_start=3.5)
map_osm
map1=map_osm
x=data[['lat','lon']].values.tolist()
HeatMap(x).add_to(map1)
map1
During this phase of the data life cycle, we attempt to perform various modeling techniques such as linear regression to obtain a predictive model of our data. This allows us to predict values for data outside of the scope of our data.
Please visit http://bigdata-madesimple.com/how-to-run-linear-regression-in-python-scikit-learn/ for more information on machine learning techniques.
Here we will use numpy.polynomial.polynomial to investigate the relationship between poverty and households not having internet access.
First plot shows the correlation within the entire nation, while the rest show the correlations within each state. However, majority of the graphs below indicate a positive correlation between gini_index and percent_no_internet. So it is highly possible that areas with larger disparities between the rich and poor have a larger number of households that lack internet connection.
# the plot of gini_index vs. percent_no_internet
plt.plot(data["percent_no_internet"], data["gini_index"], "ro")
b, m = polyfit(data["percent_no_internet"], data["gini_index"],1)
plt.plot(data["percent_no_internet"],b+m*data["percent_no_internet"],'-')
plt.title("gini_index vs. percent_no_internet")
plt.ylabel('percent_no_internet')
plt.xlabel('gini_index')
plt.show()
for i in data["state"].unique():
temp = data.loc[data["state"] == i]
regression = linear_model.LinearRegression()
reg = smf.ols('gini_index ~ percent_no_internet', data = temp).fit()
year_vector = [temp['gini_index'].tolist()]
lifeExp_vector = [temp['percent_no_internet'].tolist()]
year = np.transpose(year_vector)
lifeExp = np.transpose(lifeExp_vector)
regression = regression.fit(year, lifeExp)
plt.plot(year, regression.predict(year))
plt.scatter(temp["gini_index"], temp["percent_no_internet"], s=np.pi*6, alpha=0.5)
plt.title(i)
plt.ylabel('percent_no_internet')
plt.xlabel('gini_index')
plt.show()
print('Coefficient:', regression.coef_[0][0])
This plot reveals a positive correlation between gini_index and percent_below_poverty, which implies that areas with larger gaps between the rich and poor have higher rates of households living below poverty.
We already know that percent_no_internet correlates with gini_index positively, and since percent_below_poverty also correlates with gini_index positively, it is highly likely that poverty is one of the many factors that leads to households not having internet access.
plt.plot(data["percent_no_internet"], data["percent_below_poverty"], "ro")
b, m = polyfit(data["percent_no_internet"], data["percent_below_poverty"],1)
plt.plot(data["percent_no_internet"],b+m*data["percent_no_internet"],'-')
plt.title("gini_index vs. percent_below_poverty")
plt.ylabel('percent_below_poverty')
plt.xlabel('gini_index')
plt.show()
Here we process the data so we can use it to perform marchine learning later. In this step, we will handle the missing data by dropping the NaN values, and will investigate three potential factors which are age, location and income.
#drop nan data, rows number: 820->605
data.dropna(axis = 0, inplace = True)
#data.apply(LabelEncoder().fit_transform)
data['lon']=LabelEncoder().fit_transform(data['lon'])
data['lat']=LabelEncoder().fit_transform(data['lat'])
data['P_total']=LabelEncoder().fit_transform(data['P_total'])
data['median_age']=LabelEncoder().fit_transform(data['median_age'])
data['median_household_income']=LabelEncoder().fit_transform(data['median_household_income'])
data['percent_no_internet']=data['percent_no_internet']*100
data['percent_no_internet']=LabelEncoder().fit_transform(data['percent_no_internet'])
X = data[ ["lon", "lat", "P_total",
"median_age", "median_household_income" ]].values
y = data["percent_no_internet"].values
y
Here we use r2 score and explained variance score to evaluate the accuary of our models, and the highest we could possibly get is 1. The result is as follows:
The four numbers we get are all very close to 1, which indicates that our models are rather accurate.
kf=KFold(n_splits=10)
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
m,n=X_test.shape
y_train, y_test = y[train_index], y[test_index]
clf = tree.DecisionTreeRegressor()
clf = clf.fit(X_train, y_train)
rf = RandomForestRegressor()
rf.fit(X_train, y_train)
#use r2 score to evaluate the accuary of models
#best possible r2 score is 1.0
#in our case, these are two very good r2 score.
#Also use explained_variance_score
y_prediction1=clf.predict(X)
print(r2_score(y, y_prediction1))
print(explained_variance_score(y,y_prediction1))
y_prediction2=rf.predict(X)
print(r2_score(y, y_prediction2))
print(explained_variance_score(y,y_prediction2))
Below represents a visualization of the decision tree:
#clf tree viz
#took a really long time to run
from sklearn.externals.six import StringIO
from IPython.display import Image
from sklearn.tree import export_graphviz
import pydotplus
dd = StringIO()
export_graphviz(clf, out_file=dd, filled=True, rounded=True,
feature_names=["lon", "lat", "P_total",
"median_age", "median_household_income"])
viz1 = pydotplus.graph_from_dot_data(dd.getvalue())
Image(viz1.create_png())